1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alcohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.

library(fivethirtyeight)
data(drinks)


# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")

What are the variable types? Any missing values we should worry about?

The variable types are characters and numericals (Integer and double). There are no missing values.

glimpse(drinks)
## Rows: 193
## Columns: 5
## $ country                      <chr> "Afghanistan", "Albania", "Algeria", "And~
## $ beer_servings                <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2~
## $ spirit_servings              <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,~
## $ wine_servings                <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191~
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, ~
skimr::skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

1.1 Make a plot that shows the top 25 beer consuming countries

drinks %>% 
  slice_max(order_by = beer_servings, n=25) %>% 
  ggplot(aes(x = beer_servings, y = reorder(country,beer_servings))) +
  geom_col(fill = "chocolate")+
  theme_bw()+
  labs(
    title = "Top 25 beer consuming countries",
    subtitle = "Bar Chart",
    x = "Beer Servings",
    y = NULL
  )

Comment on top 25 beer consuming countries:

The African countries Namibia and Gabon exhibit especially high beer consumptions given their colonial history: Namibia used to be a German colony in the 20th century, and Gabon a French one. Both countries have relatively small populations (around 2 million), and their life expectancies are also relatively low (around 65 years), which also means that older people who may consume less beer are not included in the sample size.

1.2 Make a plot that shows the top 25 wine consuming countries

drinks %>% 
  slice_max(order_by = wine_servings, n=25) %>% 
  ggplot(aes(x = wine_servings, y = fct_reorder(country,wine_servings))) +
  geom_col(fill = "dark red")+
  theme_bw()+
  labs(
    title = "Top 25 wine consuming countries",
    subtitle = "Bar Chart",
    x = "Wine Servings",
    y = NULL
  )

Comment on top 25 wine consuming countries:

Andorra is a country that borders France, the highest wine consuming country, and has a population of only 77,000. Equatorial Guinea has a history of being under Portuguese and French rule which also explains its high wine consumption.

1.3 Finally, make a plot that shows the top 25 spirit consuming countries

drinks %>% 
  slice_max(order_by = spirit_servings, n=25) %>% 
  ggplot(aes(x = spirit_servings, y = fct_reorder(country,spirit_servings))) +
  geom_col(fill = "White", colour = "black")+
  theme_bw()+
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank()) +
  labs(
    title = "Top 25 spirit consuming countries",
    subtitle = "Bar Chart",
    x = "Spirit servings",
    y = NULL
  )

Comment on top 25 spirit consuming countries:

Grenada had a Liquor Licence Act passed in 1901 which allowed citizens to consume unlimited amounts of alcohol legally, and this has had a profound impact on the drinking culture in the country. Many locals are arguing that more regulation needs to be passed to improve the situation. Many of the countries in the top 25 are in the Caribbean which has a strong spirit culture centred around its rum manufacturing capabilities.

What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

What we can infer from these plots is that alcohol consumption can easily be traced through cultural and historical factors. For example, Namibia used to be a German colony and hence consumes the most beer, France is largely where viticulture originated, and Grenada, the top spirit consumer, is in the Caribbean which has a history of spirit manufacturing. What is interesting is that very few countries are repeatedly seen in the top 25 of either beer, wine or spirits. What this shows is that once a country is high in one type of alcohol, it tends to consume very little of other types.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge~
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "~
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow~
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20~
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1~
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, ~
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, ~
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920~
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9~
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35~
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, ~
skimr::skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

2.1 Use your data import, inspection, and cleaning skills to answer the following:

Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

There are no missing values, but there are some duplicate entries. That is why we created the variable “unique_movies” to account for duplicates in analyzing our data.

  • Produce a table with the count of movies by genre, ranked in descending order

    unique_movies <- movies %>% 
      distinct(title,genre,director,year,.keep_all = TRUE)
    movies_count_genre<-unique_movies %>% 
    group_by(genre) %>% 
      count(sort=TRUE)
    
      head(movies_count_genre)
    ## # A tibble: 6 x 2
    ## # Groups:   genre [6]
    ##   genre         n
    ##   <chr>     <int>
    ## 1 Comedy      844
    ## 2 Action      719
    ## 3 Drama       484
    ## 4 Adventure   281
    ## 5 Crime       198
    ## 6 Biography   135

Comment:

Genres that cater to a wider crowd tend to have more demand and therefore more are produced (i.e. comedy and action films). Other genres such as crime, biography and drama tend to require more specific interests which cater to a smaller crowd and thus garner less demand.

  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order

    returns_movies <- unique_movies %>% 
      group_by(genre)%>%
      summarise(mean_gross = mean(gross),
                mean_budget = mean(budget)
                ) %>% 
      mutate(return_on_budget = (mean_gross-mean_budget)/mean_budget) %>% 
      arrange(-return_on_budget)
    head(returns_movies)
    ## # A tibble: 6 x 4
    ##   genre       mean_gross mean_budget return_on_budget
    ##   <chr>            <dbl>       <dbl>            <dbl>
    ## 1 Musical      92084000     3189500             27.9 
    ## 2 Family      149160478.   14833333.             9.06
    ## 3 Western      20821884     3465000              5.01
    ## 4 Documentary  17353973.    5887852.             1.95
    ## 5 Horror       37782310.   13804379.             1.74
    ## 6 Fantasy      41902674.   18484615.             1.27

Comment:

Musicals have the highest return on budget by far. This is because their mean budget tends to be very low compared to horror and fantasy movies which have very high mean budgets because they require both special effects and live actors. Musicals also tend to exhibit many elements that family films also have which lets them draw in a wider audience.

  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.

    top_directors<- unique_movies %>% 
      group_by(director) %>% 
      summarise(
                sum_gross=sum(gross),
                mean_gross = mean(gross),
                median_gross = median(gross),
                sd_gross = sd(gross)) %>% 
      arrange(desc(sum_gross))
    
    head(top_directors, 15)
    ## # A tibble: 15 x 5
    ##    director           sum_gross mean_gross median_gross   sd_gross
    ##    <chr>                  <dbl>      <dbl>        <dbl>      <dbl>
    ##  1 Steven Spielberg  4014061704 174524422.   164435221  101421051.
    ##  2 Michael Bay       2195443511 182953626.   168468240. 125789167.
    ##  3 James Cameron     1909725910 318287652.   175562880. 309171337.
    ##  4 Christopher Nolan 1813227576 226653447    196667606. 187224133.
    ##  5 George Lucas      1741418480 348283696    380262555  146193880.
    ##  6 Robert Zemeckis   1619309108 124562239.   100853835   91300279.
    ##  7 Tim Burton        1557078534 111219895.    69791834   99304293.
    ##  8 Sam Raimi         1443167519 180395940.   138480208  174705230.
    ##  9 Clint Eastwood    1378321100  72543216.    46700000   75487408.
    ## 10 Francis Lawrence  1358501971 271700394.   281666058  135437020.
    ## 11 Ron Howard        1335988092 111332341    101587923   81933761.
    ## 12 Gore Verbinski    1329600995 189942999.   123207194  154473822.
    ## 13 Andrew Adamson    1137446920 284361730    279680930. 120895765.
    ## 14 Shawn Levy        1129750988 102704635.    85463309   65484773.
    ## 15 Ridley Scott      1128857598  80632686.    47775715   68812285.

    Comment:

Steven Spielberg has a much higher gross revenue than the other directors. This can be attributed to the fact that he directed many films which are part of bigger franchise: e.g. Jurassic Park, Indiana Jones and Jaws. James Cameron has a disproportionately high standard deviation as two of his films (Titanic and Avatar) have grossed much more revenue than his other films.

  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.

    rating_distribution <- unique_movies %>% 
      group_by(genre) %>% 
      summarise(
                min_rating=min(rating),
                max_rating = max(rating),
                mean_rating = mean(rating),
                median_rating = median(rating),
                sd_rating = sd(rating)) 
     head(rating_distribution)
    ## # A tibble: 6 x 6
    ##   genre     min_rating max_rating mean_rating median_rating sd_rating
    ##   <chr>          <dbl>      <dbl>       <dbl>         <dbl>     <dbl>
    ## 1 Action           2.1        9          6.23           6.3     1.04 
    ## 2 Adventure        2.3        8.6        6.51           6.6     1.11 
    ## 3 Animation        4.5        8          6.65           6.9     0.968
    ## 4 Biography        4.5        8.9        7.11           7.2     0.760
    ## 5 Comedy           1.9        8.8        6.11           6.2     1.02 
    ## 6 Crime            4.8        9.3        6.92           6.9     0.853
    Rating_histogram <- ggplot(unique_movies, aes(x = rating)) +
      geom_histogram()+
      theme_bw()+
      labs(title = "Rating Distribution",
        subtitle = "Bar Chart",
        x = "Rating",
        y = "Count"
      )
    
    Rating_histogram

Comment:

There is less correlation between gross earnings and ratings for the comedy and horror genres because people often watch these types of films either to have a laugh or be scared regardless of the quality of the film or the plot. On the other hand, action and adventure films are more highly regarded based on the ratings they obtain as people watch these movies based on their perceived quality. The count of film ratings appears to be normally distributed, but is skewed to the left as most movies tend to be within the 5-7.5 rating range. The reason for this skewness is because filmmakers would not produce films unless they expected them to produce a rating of at least 5.

2.2 Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatter plot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
ggplot(unique_movies, aes(y = gross, x = cast_facebook_likes)) +
  geom_point()+
  geom_smooth()+
   theme_bw()+
  labs(
    title = "Relationship between Gross Earnings and Facebook Likes",
    subtitle = "Scatter Plot",
    y = "Gross Earnings (US$)",
    x = "Facebook Likes"
  )+
    scale_y_log10()+
    scale_x_log10()+
  NULL  

Comment:

There is no strong positive relationship between gross earnings and Facebook likes. Having higher Facebook likes does not necessarily mean higher revenues as likes do not necessarily translate into people going out to spend money and watch the film. Nevertheless, the highest grossing films do tend to have quite a few Facebook likes as they are very popular. Our x-axis is Facebook likes and our y-axis is Gross Earnings because we are trying to test if Facebook likes is a good predictor of Gross Earnings.

  • Examine the relationship between gross and budget. Produce a scatter plot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
ggplot(unique_movies, aes(y = gross, x = budget)) +
  geom_point()+
  geom_smooth()+
   theme_bw()+
  labs(
    title = "Relationship between Gross Earnings and Budget",
    subtitle = "Scatter Plot",
    y = "Gross Earnings (US$)",
    x = "Budget"
  )+
  scale_y_log10()+
  scale_x_log10()+
  NULL 

Comment:

There is not a strong positive correlation between gross earnings and budget, until the budget surpasses 1 million. After this point, there begins to be a stronger positive correlation. The sample size of films with a budget of less than 1 million is also very small. There are quite a few films with high budgets which flop at the box office, which shows why the correlation is not that strong.

  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
ggplot(unique_movies, aes(x = rating, y = gross)) +
  geom_point()+
  facet_wrap(~genre)+
  geom_smooth()+
   theme_bw()+
  labs(
    title = "Relationship between Gross Earnings and Rating for each Genre",
    subtitle = "Faceted Scatter Plot",
    x = "Rating",
    y = "Gross Earnings (US$)"
  )+
  #scale_y_log10()+
  #scale_x_log10()
   
  NULL 

Comment:

There is less correlation between gross earnings and ratings for the comedy and horror genres because people often watch these types of films either to have a laugh or be scared regardless of the quality of the film or the plot. On the other hand, action and adventure films are more highly regarded based on the ratings they obtain as people watch these movies based on their perceived quality. Biography films also exhibit an interesting dynamic: there is a positive correlation between earnings and ratings until a certain point where it seems the highest rated biography films have lower earnings than expected. Similarly interesting patterns can be found in mystery and sci-fi films which peak in their earnings in the middle of their ratings distribution. However, these trends can also be explained by the small dataset.

3 Returns of financial stocks

3.1 We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

library(tidyquant)
nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "~
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie ~
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999~
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Heal~
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "Ma~
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq~
skimr::skim(nyse)
Data summary
Name nyse
Number of rows 508
Number of columns 6
_______________________
Column type frequency:
character 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
symbol 0 1 1 4 0 508 0
name 0 1 5 48 0 505 0
ipo_year 0 1 3 4 0 33 0
sector 0 1 6 21 0 12 0
industry 0 1 5 62 0 103 0
summary_quote 0 1 31 34 0 508 0
  • Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
unique_companies <- nyse %>% 
    distinct(symbol, name,ipo_year,.keep_all = TRUE)

#divide by sector
companies_per_sector <- unique_companies %>% 
  group_by(sector) %>% 
  count(sort=TRUE) 
  companies_per_sector  
## # A tibble: 12 x 2
## # Groups:   sector [12]
##    sector                    n
##    <chr>                 <int>
##  1 Finance                  97
##  2 Consumer Services        79
##  3 Public Utilities         60
##  4 Capital Goods            45
##  5 Health Care              45
##  6 Energy                   42
##  7 Technology               40
##  8 Basic Industries         39
##  9 Consumer Non-Durables    31
## 10 Miscellaneous            12
## 11 Transportation           10
## 12 Consumer Durables         8
companies_per_sector_plot <- unique_companies %>% 
  group_by(sector) %>% 
  count(sort=TRUE) %>%
  ggplot(aes( x = n, y = reorder(sector,n)))+
  geom_col()+
  theme_bw()+
      labs(title = "Number of companies per Sector",
        subtitle = "Bar Chart",
        x = "Number of Companies",
        y = NULL)

companies_per_sector_plot

Comment:

The NYSE contains many more traditional industries (e.g. banks and insurance companies) which is why Finance and Consumer Services companies make up the highest number in the dataset. Many Technology companies instead choose to list on the NASDAQ, which explains why there are proportionately fewer Tech companies in this dataset.

3.2 Next, let’s choose the Dow Jones Industrial Aveareg (DJIA) stocks and their ticker symbols and download some data. Besides the thirty stocks that make up the DJIA, we will also add SPY which is an SP500 ETF (Exchange Traded Fund). The following code

djia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"


#get tables that exist on URL
tables <- djia_url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called djia. 
# Use purr::map() to create a list of all tables in URL
djia <- map(tables, . %>% 
               html_table(fill=TRUE)%>% 
               clean_names())


# constituents
table1 <- djia[[2]] %>% # the second table on the page contains the ticker symbols
  mutate(date_added = ymd(date_added),
         
         # if a stock is listed on NYSE, its symbol is, e.g., NYSE: MMM
         # We will get prices from yahoo finance which requires just the ticker
         
         # if symbol contains "NYSE*", the * being a wildcard
         # then we jsut drop the first 6 characters in that string
         ticker = ifelse(str_detect(symbol, "NYSE*"),
                          str_sub(symbol,7,11),
                          symbol)
         )

# we need a vector of strings with just the 30 tickers + SPY
tickers <- table1 %>% 
  select(ticker) %>% 
  pull() %>% # pull() gets them as a sting of characters
  c("SPY") # and lets us add SPY, the SP500 ETF
  • Now let us downlaod prices for all 30 DJIA consituents and the SPY ETF that tracks SP500 since January 1, 2020
# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, # cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- tickers %>% 
  tq_get(get  = "stock.prices",
         from = "2000-01-01",
         to   = Sys.Date()) %>% # Sys.Date() returns today's price
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 161,491
## Columns: 8
## Groups: symbol [31]
## $ symbol   <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM"~
## $ date     <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07, ~
## $ open     <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, 4~
## $ high     <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, 4~
## $ low      <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, 4~
## $ close    <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, 4~
## $ volume   <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 2357600~
## $ adjusted <dbl> 27.2, 26.1, 26.9, 29.0, 29.6, 29.4, 28.9, 29.0, 29.0, 28.6, 2~
  • Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 
              

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))
  • Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
table_monthly_returns <- myStocks_returns_monthly %>% 
  group_by(symbol) %>% 
  summarise(
                mean_monthly_returns = mean(monthly_returns),
                min_ret=min(monthly_returns),
                max_ret = max(monthly_returns),
                mean_ret = mean(monthly_returns),
                median_ret = median(monthly_returns),
                sd_ret = sd(monthly_returns))
                
table_monthly_returns
## # A tibble: 31 x 7
##    symbol mean_monthly_returns min_ret max_ret mean_ret median_ret sd_ret
##    <chr>                 <dbl>   <dbl>   <dbl>    <dbl>      <dbl>  <dbl>
##  1 AAPL                0.0269   -0.577   0.454  0.0269     0.0338  0.115 
##  2 AMGN                0.00826  -0.170   0.328  0.00826    0.00936 0.0733
##  3 AXP                 0.0101   -0.279   0.875  0.0101     0.0112  0.0922
##  4 BA                  0.0127   -0.458   0.459  0.0127     0.0167  0.0926
##  5 CAT                 0.0143   -0.353   0.350  0.0143     0.0133  0.0899
##  6 CRM                 0.0263   -0.360   0.403  0.0263     0.0199  0.111 
##  7 CSCO                0.00611  -0.367   0.389  0.00611    0.0117  0.0963
##  8 CVX                 0.00852  -0.224   0.273  0.00852    0.0101  0.0653
##  9 DIS                 0.0105   -0.268   0.234  0.0105     0.00995 0.0737
## 10 DOW                 0.0149   -0.276   0.255  0.0149     0.0367  0.111 
## # ... with 21 more rows

Comment:

The Dow Jones Industrial Average is a stock price-weighted index composed of 30 of the most important publicly-traded companies in the United. Due to this measure, we can observe some companies which have a smaller weighting have a larger market capitalization value. An example being Honeywell, which has a 4.29% weighting has a market capitalization of $152 billion. On the other hand, Apple only has a weighting 2.78%, yet has a total market capitalization of $2.43 trillion.

  • Plot a density plot, using geom_density(), for each of the stocks
density_monthly_returns <- 
  ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
  geom_density()+
  facet_wrap(~symbol)+
  theme_bw()+
  labs(title = "Monthly returns for each stock",
       subtitle="Density Plot",
       x = "Monthly Returns",
       y = "Density") + 
       NULL

density_monthly_returns

What can you infer from this plot? Which stock is the riskiest? The least risky?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

Stocks such as AAPL, DOW and INTC have the largest standard deviations presented in the previous table. Given the nature of their cyclical businesses, they present higher levels of risk. Riskier companies tend to display more variability in terms of monthly returns and lower levels of kurtosis. Non-cyclical stocks such as PG, KO, JNJ tend to underperform cyclical stocks in a bull market, and will also show lower levels of kurtosis, as they perceive less extreme returns. These companies can be perceived as safer.

  • Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
library(ggrepel)


risk_return_plot <- 
ggplot(table_monthly_returns, aes(x = sd_ret , y = mean_monthly_returns))+
  geom_point()+
  geom_smooth()+
  theme_bw()+
  labs(title = "Relationship between monthly returns and standard deviation",
       subtitle="Scatter Plot",
       x = "Standard Deviation",
       y = "Monthly Returns") + 
       geom_text_repel(data = table_monthly_returns, aes(x = sd_ret , y = mean_monthly_returns, label = symbol))+
       NULL

risk_return_plot

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.

The graphs display different stock returns and their respective standard deviations representing the Efficient-market hypothesis. The stock with the lowest standard deviation, being the SPY (S&P 500 ETF) displays similar returns to other stocks with much higher risk. This can be explained by the previous theory, as unsystematic risk is not prized with higher returns. The relationship between stock returns and risk shows a very weak positive relationship. Excluding outliers such as AAPL and CRM, most stocks performance does not increase with risk. An example being JPM and CSCO stocks, which both have much more risk than stocks such as MCD PG, yet display similar returns.

4 Is inflation transitory?

The surge in inflation seen across major economies is probably short lived because it’s confined to just a few sectors of the economy, according to the Bank for International Settlements.

New research by the BIS’s Claudio Borio, Piti Disyatat, Egon Zakrajsek and Dora Xia adds to one of the hottest debates in economics – how long the current surge in consumer prices will last. Both Federal Reserve Chair Jerome Powell and his euro-area counterpart Christine Lagarde have said the pickup is probably transitory, despite a snarled global supply chain and a spike in energy prices.

You have to download data for CPI and the 10 year bill and produce the following graph

The relevant indicators from FRED are:

cpi<-   tq_get("CPIAUCSL", get = "economic.data",
                       from = "1980-01-01") %>% 
  rename(cpi = symbol,  # FRED data is given as 'symbol' and 'price'
         rate = price) %>% # we rename them to what they really are, e.g., cpi and rate
  
  # calculate yearly change in CPI by dividing current month by same month a year (or 12 months) earlier, minus 1
  mutate(cpi_yoy_change = rate/lag(rate, 12) - 1)

ten_year_monthly  <-   tq_get("GS10", get = "economic.data",
                       from = "1980-01-01") %>% 
  rename(ten_year = symbol,
         yield = price) %>% 
  mutate(yield = yield / 100) # original data is not given as, e.g., 0.05, but rather 5, for five percent

# we have the two dataframes-- we now need to join them, and we will use left_join()
# base R has a function merge() that does the same, but it's slow, so please don't use it

mydata <- 
  cpi %>% 
  left_join(ten_year_monthly, by="date") %>% 
  mutate(
    year = year(date), # using lubridate::year() to generate a new column with just the year
    month = month(date, label = TRUE),
    decade=case_when(
      year %in% 1980:1989 ~ "1980s",
      year %in% 1990:1999 ~ "1990s",
      year %in% 2000:2009 ~ "2000s",
      year %in% 2010:2019 ~ "2010s",
      TRUE ~ "2020s"
      )
  )

mydata
## # A tibble: 500 x 9
##    cpi      date        rate cpi_yoy_change ten_year  yield  year month decade
##    <chr>    <date>     <dbl>          <dbl> <chr>     <dbl> <dbl> <ord> <chr> 
##  1 CPIAUCSL 1980-01-01  78               NA GS10     0.108   1980 Jan   1980s 
##  2 CPIAUCSL 1980-02-01  79               NA GS10     0.124   1980 Feb   1980s 
##  3 CPIAUCSL 1980-03-01  80.1             NA GS10     0.128   1980 Mar   1980s 
##  4 CPIAUCSL 1980-04-01  80.9             NA GS10     0.115   1980 Apr   1980s 
##  5 CPIAUCSL 1980-05-01  81.7             NA GS10     0.102   1980 May   1980s 
##  6 CPIAUCSL 1980-06-01  82.5             NA GS10     0.0978  1980 Jun   1980s 
##  7 CPIAUCSL 1980-07-01  82.6             NA GS10     0.102   1980 Jul   1980s 
##  8 CPIAUCSL 1980-08-01  83.2             NA GS10     0.111   1980 Aug   1980s 
##  9 CPIAUCSL 1980-09-01  83.9             NA GS10     0.115   1980 Sep   1980s 
## 10 CPIAUCSL 1980-10-01  84.7             NA GS10     0.118   1980 Oct   1980s 
## # ... with 490 more rows
library(ggrepel)

my_plot <- ggplot(mydata, aes(x= cpi_yoy_change, y= yield, color= decade, label = date)) +
            geom_point() +
            geom_smooth(method=lm) +
            theme_bw() +
            theme(legend.position= "none") +
            facet_wrap(~decade, nrow= 5, scales ="free") +
            labs(title = "How are CPI and 10-year yield related?",
            x = "CPI Yearly Change",
            y = "10-year Treasury Constant Maturity Rate") +
            theme(axis.title.x = element_text(face = "bold")) +
            theme(axis.title.y = element_text(face = "bold")) +
            scale_y_continuous(labels=scales::percent) +
            scale_x_continuous(labels=scales::percent) +
            geom_text_repel(aes(x= cpi_yoy_change, y= yield,label = paste(mydata$month,mydata$year))) +
            NULL
            
            
my_plot            

Comment:

There is a strong positive correlation between inflation and treasury yields, but this changes in the 2000s with the global financial crisis in 2008 with very high treasury yields, but negative inflation. The global financial crisis caused the consumer sentiment at the time to go down drastically, but investors were still eager to buy the 10-year treasury bonds as they still had long-term faith in the economy and saw the bonds as a safe haven asset. The positive correlation can especially be seen for 2021 as both yields and inflation has recovered from deep-low levels during 2020. Moreover, we can observe the general trend of yields decreasing over time. After the 2008 financial crisis, we entered a period of quantitative easing which greatly decreased overall interest rates. For the 2020s period, the graph has a reduced range for the Y-variable “10-year Treasury Constant Maturity Rate”, and positive levels for X-variable “Inflation”. It is paramount to bear in mind that the 2020s period only have years 2020 and 2021 in data, while the rest of decades has 10-year full data.

5 Challenge 1: Replicating a chart

Data Source:

  1. To get vaccination by county, we will use data from the CDC
  2. You need to get County Presidential Election Returns 2000-2020
  3. Finally, you also need an estimate of the population of each county
library(scales)
library(ggpubr)

trump_vote<-election2020_results %>% 
  filter(mode=="TOTAL"|mode=="ELECTION DAY") %>% 
  select(fips, candidate, candidatevotes,totalvotes) %>% 
  filter(candidate=="DONALD J TRUMP") %>% 
  mutate(trump_vote=candidatevotes/totalvotes) %>% 
  group_by(fips) %>% 
  summarise(votes=max(trump_vote))

fully_vaccinated<-vaccinations %>% 
  mutate(month = substring(date,1,2), label = TRUE)%>% 
  filter(month== '08' ,series_complete_pop_pct!=0) %>% 
  group_by(fips) %>% 
  summarise(max_complete=max(series_complete_yes))

table_new<-population %>% 
  left_join(trump_vote, by="fips") %>% 
  left_join(fully_vaccinated, by="fips") %>% 
  mutate(pct=max_complete/pop_estimate_2019)
 
table_new <-na.omit(table_new)
table_new[table_new$votes>1,3]<-1 
table_new[table_new$pct>1,5]<-1

vaccination_plot <-  ggplot(table_new,aes(x=votes,y=pct, size=pop_estimate_2019))+
                      geom_point(color= "deepskyblue", alpha=0.5)+
                      geom_smooth(method=lm, se= FALSE )+
                      geom_hline(yintercept=0.508,linetype="dashed") + 
                      geom_text(aes(0,0.508,label = "ACTUAL 50.8%", hjust=0.2, vjust = -0.5), size= 2.5)+
                      geom_hline(yintercept=0.539,linetype="dashed") +
                      geom_text(aes(0,0.539,label = "TARGET 53.9%", hjust=0.2, vjust = -0.5), size= 2.5)+
                      geom_hline(yintercept=0.85,linetype="dashed") +
                      geom_text(aes(0,0.85,label = "Herd Immunity Threshold (?)",hjust=0.1, vjust = -0.5), size= 2.5)+
                      scale_size(range = c(.1, 24))+
                      scale_y_continuous(breaks = seq (0, 1, by = 0.05), limits= c(0,1), labels=percent_format(accuracy=1)) +
                      scale_x_continuous(breaks = seq (0, 1, by = 0.05), limits= c(0,1),labels=percent_format(accuracy=1)) +
                      labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
                      subtitle= "(most states based on FULLY vaccinated only; CA, GA, IA, MI & TX based on total doses administered)
                      Data via Centers for Disease Control, COVID Act Now, state health depts
                      Graph by Charles Gaba / ACASignups.net",
                      x = "2020 Trump Vote %", 
                      y = "% of Total Population Vaccinated") +
                      annotate("text",x=-Inf,y=Inf,hjust=-1.5,vjust=2.3,label="EVERY U.S. COUNTY") +
                      theme_bw()+
                      theme(legend.position= "none", plot.title = element_text(hjust=0.5), plot.subtitle = element_text(hjust=0.5), aspect.ratio=0.95)+
  stat_regline_equation(label.x=0, label.y=0.035) +
        stat_cor(aes(label=..rr.label..), label.x=0, label.y=0)+
                      NULL

vaccination_plot                    

Comment:

There is a slight negative correlation between % voting for Trump and % of total people vaccinated. This makes sense as Trump’s policies were very COVID-sceptic and vaccine-sceptic, so the counties voting for him would exhibit the same tendencies. There are 77 counties below the line of best fit in the majority non-Trump vote. Out of these counties, 86% are non-white majority. In these counties, the general level of wealth is lower. For example, many of these counties are in Alabama and Mississippi whose GDP/capita are $41,000 and $35,000 respectively. The average GDP/capita in the US is $61,000 which shows that these countries are at least 33% below the average. This lower level of wealth means that in these counties, there is probably less access to good-quality education which is a key factor in getting vaccinated. Without these counties, we would observe on the graph a much stronger negative correlation between % voting for Trump and % of total people vaccinated.

6 Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the Wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"

# similar graphs and analyses can be found at 
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())

polls
## [[1]]
## # A tibble: 252 x 13
##    polling_firm    fieldwork_date samplesize abs   union   spd  af_d   fdp linke
##    <chr>           <chr>          <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize "Abs~  NA    NA      NA  NA    NA  
##  2 2021 federal e~ 26 Sep 2021    –          ""     NA    NA      NA  NA    NA  
##  3 Wahlkreisprogn~ 22–24 Sep 2021 1,400      "–"    22.5  25.5    11  12     7  
##  4 Ipsos           22–23 Sep 2021 2,000      "–"    22    26      11  12     7  
##  5 Forschungsgrup~ 22–23 Sep 2021 1,273      "–"    23    25      10  11     6  
##  6 Forsa           20–23 Sep 2021 2,002      "26"   22    25      10  12     6  
##  7 Allensbach      16–23 Sep 2021 1,554      "–"    25    26      10  10.5   5  
##  8 Civey           16–23 Sep 2021 10,012     "–"    23    25      10  12     6  
##  9 YouGov          16–22 Sep 2021 2,364      "–"    21    25      12  11     7  
## 10 Wahlkreisprogn~ 20–21 Sep 2021 1,801      "–"    21.5  25      11  12.5   6.5
## # ... with 242 more rows, and 4 more variables: grune <dbl>, fw <chr>,
## #   others <chr>, lead <chr>
## 
## [[2]]
## # A tibble: 226 x 12
##    polling_firm    fieldwork_date samplesize abs   union   spd  af_d   fdp linke
##    <chr>           <chr>          <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize Abs.     NA  NA      NA    NA  NA  
##  2 Forsa           21–23 Dec 2020 1,501      –        36  15       9     6   9  
##  3 INSA            18–21 Dec 2020 2,055      –        35  16      11     8   7.5
##  4 Forsa           14–18 Dec 2020 2,505      23       37  15       8     6   8  
##  5 Kantar          10–16 Dec 2020 2,401      –        35  17      10     6   8  
##  6 INSA            11–14 Dec 2020 2,002      –        36  17      10     7   7.5
##  7 Forsa           7–11 Dec 2020  2,503      23       37  15       8     5   8  
##  8 Allensbach      28 Nov–10 Dec~ 1,022      –        37  16.5     9     7   7  
##  9 Infratest dimap 7–9 Dec 2020   1,004      –        36  16       9     6   7  
## 10 Forschungsgrup~ 7–9 Dec 2020   1,246      25       37  16      10     5   8  
## # ... with 216 more rows, and 3 more variables: grune <dbl>, others <chr>,
## #   lead <chr>
## 
## [[3]]
## # A tibble: 234 x 12
##    polling_firm    fieldwork_date samplesize abs   union   spd  af_d   fdp linke
##    <chr>           <chr>          <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize Abs.     NA    NA    NA  NA    NA  
##  2 INSA            20–23 Dec 2019 2,034      –        28    13    15  10     8  
##  3 Forsa           16–20 Dec 2019 2,501      23       28    13    13   8     8  
##  4 Emnid           12–18 Dec 2019 1,899      –        27    15    14   9     9  
##  5 YouGov          13–17 Dec 2019 1,586      –        27    13    15   7    10  
##  6 INSA            13–16 Dec 2019 2,020      –        27    13    15   8.5   9.5
##  7 Forsa           9–13 Dec 2019  2,502      22       28    12    13   8     9  
##  8 Forschungsgrup~ 10–12 Dec 2019 1,366      –        27    13    14   8     9  
##  9 Infratest dimap 10–11 Dec 2019 1,038      –        27    14    15   8     8  
## 10 Emnid           5–11 Dec 2019  1,978      –        28    16    13   9     9  
## # ... with 224 more rows, and 3 more variables: grune <dbl>, others <chr>,
## #   lead <chr>
## 
## [[4]]
## # A tibble: 253 x 12
##    polling_firm    fieldwork_date samplesize abs   union   spd  af_d   fdp linke
##    <chr>           <chr>          <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize Abs.     NA    NA    NA  NA    NA  
##  2 YouGov          21–27 Dec 2018 1,665      –        30    16    14   8    10  
##  3 INSA            20–21 Dec 2018 1,029      –        29    14    15   9     9  
##  4 Forsa           17–21 Dec 2018 2,504      24       31    14    13   8     8  
##  5 Emnid           13–19 Dec 2018 1,889      –        29    15    14   9     9  
##  6 INSA            14–17 Dec 2018 2,077      –        29    15    15   9.5   9.5
##  7 Forsa           10–14 Dec 2018 2,507      24       32    15    12   8     8  
##  8 Forschungsgrup~ 11–13 Dec 2018 1,268      –        30    15    15   7     9  
##  9 Infratest dimap 11–12 Dec 2018 1,048      –        31    15    13   8     8  
## 10 Emnid           6–12 Dec 2018  1,997      –        30    15    14   8     8  
## # ... with 243 more rows, and 3 more variables: grune <dbl>, others <chr>,
## #   lead <chr>
## 
## [[5]]
## # A tibble: 68 x 12
##    polling_firm    fieldwork_date samplesize abs   union   spd  af_d   fdp linke
##    <chr>           <chr>          <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize Abs.     NA  NA      NA  NA      NA
##  2 YouGov          25–27 Dec 2017 1,489      –        33  21      13   9      10
##  3 INSA            21–22 Dec 2017 2,203      –        33  20.5    13  10.5    10
##  4 Forsa           18–22 Dec 2017 2,504      21       34  19      12   8      10
##  5 Emnid           14–20 Dec 2017 1,861      –        33  21      12   8      10
##  6 INSA            15–18 Dec 2017 2,031      –        31  21      14   9      11
##  7 Forsa           11–15 Dec 2017 2,501      21       33  20      12   8      10
##  8 Allensbach      1–14 Dec 2017  1,443      –        34  21      11  10       9
##  9 Infratest dimap 11–13 Dec 2017 1,029      –        32  20      13   9       9
## 10 Emnid           7–13 Dec 2017  1,865      –        32  22      13   8       9
## # ... with 58 more rows, and 3 more variables: grune <dbl>, others <chr>,
## #   lead <chr>
## 
## [[6]]
## # A tibble: 211 x 12
##    polling_firm fieldwork_date    samplesize   cdu   spd  af_d   fdp linke grune
##    <chr>        <chr>             <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm Fieldwork date    Samplesize  NA    NA    NA    NA    NA    NA  
##  2 INSA         17–20 Sep 2021    2,054       17    25    11    12     6.5  15  
##  3 INSA         10–13 Sep 2021    2,062       16    26    11.5  12.5   6.5  15  
##  4 INSA         3–6 Sep 2021      2,052       15.5  26    11    12.5   6.5  15.5
##  5 INSA         27–30 Aug 2021    2,015       15    25    11    13.5   7    16.5
##  6 INSA         20–23 Aug 2021    2,119       18    23    11    13     7    17  
##  7 INSA         13–16 Aug 2021    2,080       20    20    11    12.5   6.5  17.5
##  8 INSA         6–9 Aug 2021      2,118       20.5  17.5  11.5  12.5   6.5  17.5
##  9 INSA         30 Jul–2 Aug 2021 2,080       22.5  18    11    13     7    18  
## 10 INSA         23–26 Jul 2021    2,007       22    17.5  12    13     6    17.5
## # ... with 201 more rows, and 3 more variables: csu <dbl>, others <chr>,
## #   lead <chr>
## 
## [[7]]
## # A tibble: 4 x 11
##   polling_firm    fieldwork_date  samplesize   cdu   spd grune   fdp  af_d linke
##   <chr>           <chr>           <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm    Fieldwork date  Samplesize  NA    NA    NA    NA    NA    NA  
## 2 Forsa           29 Jan – 1 Feb~ 1,007       33    10    24    10    12     6  
## 3 Forsa           8–22 Feb 2018   1,003       35    13    17    11    12     8  
## 4 2017 federal e~ 24 Sep 2017     –           34.4  16.4  13.5  12.7  12.2   6.4
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[8]]
## # A tibble: 24 x 12
##    polling_firm    fieldwork_date samplesize   csu   spd  af_d   fdp grune linke
##    <chr>           <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize    NA    NA    NA    NA    NA    NA
##  2 GMS             13–15 Sep 2021 1,002         29    17    11    13    16     5
##  3 GMS             8–13 Sep 2021  1,005         28    18    11    12    17     4
##  4 Infratest dimap 2–6 Sep 2021   1,195         28    18    10    12    16     3
##  5 GMS             1–6 Sep 2021   1,003         29    15    10    13    18     3
##  6 GMS             21–27 Jul 2021 1,003         35     9     9    12    20     4
##  7 INSA            12–19 Jul 2021 1,000         36    11     9     9    19     4
##  8 Infratest dimap 2–6 Jul 2021   1,186         36     9    10    11    18     4
##  9 GMS             16–21 Jun 2021 1,007         34    10     9    11    22     3
## 10 GMS             11–17 May 2021 1,003         32     8    10    11    26     4
## # ... with 14 more rows, and 3 more variables: free_voters <chr>, others <chr>,
## #   lead <chr>
## 
## [[9]]
## # A tibble: 30 x 11
##    polling_firm fieldwork_date    samplesize   cdu linke   spd grune  af_d   fdp
##    <chr>        <chr>             <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm Fieldwork date    Samplesize    NA    NA    NA    NA    NA    NA
##  2 Forsa        29 Jan – 6 Feb 2~ 1,001         21    16    12    26    10     6
##  3 Forsa        12–19 Dec 2019    1,005         21    16    13    24    10     7
##  4 Forsa        21–26 Nov 2019    1,006         20    16    14    24    11     7
##  5 Forsa        22–31 Oct 2019    1,002         22    15    13    24    11     6
##  6 Forsa        17–26 Sep 2019    1,002         22    15    13    24    11     7
##  7 Forsa        20–29 Aug 2019    1,003         22    17    14    24    10     7
##  8 Forsa        17–25 Jul 2019    1,001         22    17    14    24    10     6
##  9 Forsa        17–27 Jun 2019    1,004         21    16    14    24    11     7
## 10 Forsa        20–27 May 2019    1,006         19    16    14    25    12     7
## # ... with 20 more rows, and 2 more variables: others <chr>, lead <chr>
## 
## [[10]]
## # A tibble: 6 x 11
##   polling_firm     fieldwork_date samplesize   cdu  af_d   spd linke   fdp grune
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA    NA    NA    NA    NA      NA
## 2 Infratest dimap  25–28 Aug 2021 1,157       15    18    29    11     9       9
## 3 Forsa            10–15 Dec 2020 1,001       28    17    15    12     4      17
## 4 Forsa            17–20 Dec 2018 1,005       23    20    12    18     7      14
## 5 Forsa            7–9 Nov 2017   1,002       26    20    16    19     8       6
## 6 2017 federal el~ 24 Sep 2017    –           26.7  20.2  17.6  17.2   7.1     5
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[11]]
## # A tibble: 8 x 11
##   polling_firm    fieldwork_date  samplesize   cdu   spd grune linke   fdp  af_d
##   <chr>           <chr>           <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm    Fieldwork date  Samplesize  NA    NA    NA    NA    NA    NA  
## 2 Trend Research  13–16 Sep 2021  703         14    33    18     9    14     7  
## 3 Trend Research  30 Aug–2 Sep 2~ 708         15    34    17    10    13     7  
## 4 Trend Research  12–18 Aug 2021  702         17    28    19     9    14     7  
## 5 Forsa           18 Dec 2019–6 ~ 1,069       21    16    29    11     8     9  
## 6 Universität Ha~ 6 Jan–2 Mar 20~ 1,069       20    24    32     8    10     3  
## 7 Forsa           27 Dec 2018–3 ~ 1,004       22    17    26    11    10     9  
## 8 2017 federal e~ 24 Sep 2017     –           27.2  23.5  13.9  12.2  10.8   7.8
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[12]]
## # A tibble: 3 x 11
##   polling_firm     fieldwork_date samplesize   cdu   spd  af_d   fdp grune linke
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA    NA    NA    NA    NA    NA  
## 2 Forsa            8–22 Feb 2018  1,035       31    19    10    12    14    10  
## 3 2017 federal el~ 24 Sep 2017    –           30.9  23.5  11.9  11.5   9.7   8.1
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[13]]
## # A tibble: 16 x 11
##    polling_firm    fieldwork_date samplesize   cdu  af_d linke   spd   fdp grune
##    <chr>           <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Polling firm    Fieldwork date Samplesize  NA    NA    NA    NA    NA    NA  
##  2 INSA            10–16 Sep 2021 1,000       14    19    11    38     6     9  
##  3 Infratest dimap 2–7 Sep 2021   1,180       16    18    11    31     9     8  
##  4 Infratest dimap 19–24 Aug 2021 1,153       19    18    12    29     9     7  
##  5 INSA            6–12 Aug 2021  1,000       21    18    12    25     8    10  
##  6 INSA            16–22 Jul 2021 1,098       26    21    13    17     8    11  
##  7 Infratest dimap 8–13 Jul 2021  1,159       30    17    13    17     8     8  
##  8 Infratest dimap 12–18 May 2021 1,245       23    16    12    18     7    16  
##  9 Forsa           12–15 Jan 2021 1,002       32    15    16    12     5    13  
## 10 Infratest dimap 18–21 Nov 2020 1,000       34    15    12    20     4    10  
## 11 Forsa           2–10 Jan 2020  1,002       27    20    16     8     6    14  
## 12 Forsa           18–23 Sep 2019 1,002       28    21    14    10     6    14  
## 13 Forsa           4–11 Jan 2019  1,007       32    20    16    11     5    11  
## 14 Forsa           24–28 Jun 2018 1,002       30    21    19    12     5     8  
## 15 Forsa           2–11 Jan 2018  1,001       33    18    18    15     5     6  
## 16 2017 federal e~ 24 Sep 2017    –           33.1  18.6  17.8  15.1   6.2   4.3
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[14]]
## # A tibble: 5 x 11
##   polling_firm     fieldwork_date samplesize   cdu   spd   fdp  af_d grune linke
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA    NA    NA    NA    NA      NA
## 2 Forsa            19–28 May 2020 1,002       40    20     5     7    16       6
## 3 Forsa            1–8 Feb 2019   1,010       33    19     8     9    19       7
## 4 Forsa            8–22 Feb 2018  1,004       36    24     9     8    12       7
## 5 2017 federal el~ 24 Sep 2017    –           34.9  27.4   9.3   9.1   8.7     7
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[15]]
## # A tibble: 6 x 11
##   polling_firm     fieldwork_date samplesize   cdu   spd   fdp  af_d grune linke
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA      NA  NA    NA    NA    NA  
## 2 Forsa            10–17 May 2021 1,058       24      18  11     8    28     4  
## 3 Forsa            Mar–Apr 2021   1,090       26      18  11     8    25     6  
## 4 Forsa            1–16 Aug 2019  1,505       27      17  10     9    25     6  
## 5 Forsa            8–22 Feb 2018  1,015       34      21  11     9    11     8  
## 6 2017 federal el~ 24 Sep 2017    –           32.6    26  13.1   9.4   7.6   7.5
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[16]]
## # A tibble: 7 x 12
##   polling_firm     fieldwork_date samplesize   cdu   spd  af_d   fdp grune linke
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA    NA    NA    NA    NA    NA  
## 2 Infratest dimap  2–7 Sep 2021   1,160       23    30    10    11    13     4  
## 3 Infratest dimap  9–13 Jul 2021  1,153       30    22    11     9    17     4  
## 4 Forsa            1–8 Feb 2019   1,005       34    16    10    10    18     7  
## 5 Infratest dimap  5–7 Mar 2018   1,001       34    23    12     9    12     7  
## 6 Infratest dimap  8–12 Dec 2017  1,003       39    24     9     8    12     5  
## 7 2017 federal el~ 24 Sep 2017    –           35.9  24.2  11.2  10.4   7.6   6.8
## # ... with 3 more variables: free_voters <chr>, others <chr>, lead <chr>
## 
## [[17]]
## # A tibble: 5 x 11
##   polling_firm     fieldwork_date samplesize  af_d   cdu linke   spd   fdp grune
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize    NA  NA    NA    NA    NA    NA  
## 2 INSA             6–13 Sep 2021  1,000         26  18    11    18    11     8  
## 3 Infratest dimap  13–18 Aug 2021 1,179         23  21    11    15    12    10  
## 4 INSA             2–9 Aug 2021   1,001         25  24    13    10    11    10  
## 5 2017 federal el~ 24 Sep 2017    –             27  26.9  16.1  10.5   8.2   4.6
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[18]]
## # A tibble: 3 x 11
##   polling_firm     fieldwork_date samplesize   cdu  af_d linke   spd   fdp grune
##   <chr>            <chr>          <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Polling firm     Fieldwork date Samplesize  NA    NA    NA    NA    NA    NA  
## 2 INSA             16–23 Aug 2021 1,000       18    22    18    21     9     5  
## 3 2017 federal el~ 24 Sep 2017    –           28.8  22.7  16.9  13.2   7.8   4.1
## # ... with 2 more variables: others <chr>, lead <chr>
## 
## [[19]]
## # A tibble: 69 x 25
##    polling_firm       fieldwork_date  samplesize spd_union spd_union_2 union_fdp
##    <chr>              <chr>           <chr>          <int>       <int> <chr>    
##  1 Polling firm       Fieldwork date  Samplesize        NA          NA ""       
##  2 Forschungsgruppe ~ 14–16 Sep 2021  1,406              9           9 "8"      
##  3 Forschungsgruppe ~ 7–9 Sep 2021    1,281              9           9 "7"      
##  4 Forschungsgruppe ~ 31 Aug–2 Sep 2~ 1,301             12          12 "8"      
##  5 Forschungsgruppe ~ 24–26 Aug 2021  1,300             10          10 "6"      
##  6 Forschungsgruppe ~ 10–12 Aug 2021  1,252             12          12 "10"     
##  7 Forschungsgruppe ~ 27–29 Jul 2021  1,268             13          13 "10"     
##  8 Forschungsgruppe ~ 13–15 Jul 2021  1,224             13          13 "11"     
##  9 Forschungsgruppe ~ 22–24 Jun 2021  1,271             11          11 "11"     
## 10 Civey              8–9 Jun 2021    5,005              6           6 "–"      
## # ... with 59 more rows, and 19 more variables: union_fdp_2 <chr>,
## #   spd_grune <chr>, spd_grune_2 <chr>, union_grune <int>, union_grune_2 <int>,
## #   spd_grune_linke <int>, spd_grune_linke_2 <int>, spd_grune_linke_3 <int>,
## #   union_grune_fdp <int>, union_grune_fdp_2 <int>, union_grune_fdp_3 <int>,
## #   spd_union_fdp <chr>, spd_union_fdp_2 <chr>, spd_union_fdp_3 <chr>,
## #   spd_grune_fdp <chr>, spd_grune_fdp_2 <chr>, spd_grune_fdp_3 <chr>,
## #   spdfdp <chr>, spdfdp_2 <chr>
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )
         
library(plotly)

german_election_polls_new <- german_election_polls %>% 
  arrange(end_date) %>%
  group_by(end_date) %>% 
  summarise(CDU_mean = mean(union),
            AFD_mean = mean(af_d),
            SPD_mean = mean(spd),
            FDP_mean = mean(fdp),
            Linke_mean = mean(linke),
            Grune_mean = mean(grune))
            
#calculate 14-day rolling average

election_14_rolling_mean <- german_election_polls_new %>%   #this is a new object
  mutate( CDU = zoo::rollmean(CDU_mean, k=14, fill = NA),
    AFD = zoo::rollmean(AFD_mean, k=14, fill = NA),
    SPD = zoo::rollmean(SPD_mean, k=14, fill = NA),
    FDP = zoo::rollmean(FDP_mean, k=14, fill = NA),
    Linke = zoo::rollmean(Linke_mean, k=14, fill = NA),
    Grune = zoo::rollmean(Grune_mean, k=14, fill = NA)) 
    
election_14_rolling_mean
## # A tibble: 146 x 13
##    end_date   CDU_mean AFD_mean SPD_mean FDP_mean Linke_mean Grune_mean   CDU
##    <date>        <dbl>    <dbl>    <dbl>    <dbl>      <dbl>      <dbl> <dbl>
##  1 2021-01-04     36.5       10     15.5     6.75       7.75       18    NA  
##  2 2021-01-05     36         10     15       6          9          18    NA  
##  3 2021-01-06     35         10     14       7          7          21    NA  
##  4 2021-01-08     36          8     14       7          8          20    NA  
##  5 2021-01-11     36         10     15       7.5        8          18    NA  
##  6 2021-01-13     36         10     15       7          8          18    NA  
##  7 2021-01-14     37         10     15       5          8          20    35.9
##  8 2021-01-15     35          9     15       6          8          20    35.8
##  9 2021-01-17     35          9     15       7          8          19    35.7
## 10 2021-01-18     35.2       11     14.5     8.75       7.75       17.2  35.7
## # ... with 136 more rows, and 5 more variables: AFD <dbl>, SPD <dbl>,
## #   FDP <dbl>, Linke <dbl>, Grune <dbl>
#form new dataframe of each party

CDU_df <- election_14_rolling_mean %>%
  select(CDU, end_date) %>%
  rename(percentage = CDU)

AFD_df <- election_14_rolling_mean %>%
  select(AFD, end_date) %>%
  rename(percentage = AFD)

SPD_df <- election_14_rolling_mean %>%
  select(SPD, end_date) %>%
  rename(percentage = SPD)

FDP_df <- election_14_rolling_mean %>%
  select(FDP, end_date) %>%
  rename(percentage = FDP)

Linke_df <- election_14_rolling_mean %>%
  select(Linke, end_date) %>%
  rename(percentage = Linke)

Grune_df <- election_14_rolling_mean %>%
  select(Grune, end_date) %>%
  rename(percentage = Grune)

  
#Plotting the data

election_polls <- ggplot(CDU_df, aes(x = end_date, y = percentage)) +
                  geom_point(color = "grey", alpha = 0.3) +
                  geom_smooth(color = "black",se = FALSE) + 
                  geom_point(data = AFD_df,color = "navyblue",alpha = 0.3) +
                  geom_smooth(data = AFD_df,color = "navyblue",se = FALSE) +
                  geom_point(data = SPD_df,color = "red3",alpha = 0.3) +
                  geom_smooth(data = SPD_df,color = "red3",se = FALSE) +
                  geom_point(data = FDP_df,color = "yellow2",alpha = 0.3) +
                  geom_smooth(data = FDP_df,color = "yellow2",se = FALSE) +
                  geom_point(data = Linke_df,color = "purple3",alpha = 0.3) +
                  geom_smooth(data = Linke_df,color = "purple3",se = FALSE) +
                  geom_point(data = Grune_df,color = "green3",alpha = 0.3) +
                  geom_smooth(data = Grune_df,color = "green3",se = FALSE) +
                  theme_bw()+
                  scale_y_continuous(breaks = seq(0,45, by = 5))+
                  labs(title = "Opinion Polling for the 2021 German Federal Election",
                  x = "Date", 
                  y = "Predicted votes (%)",
                  caption = "Source:https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election") +
                  theme(plot.caption = element_text(hjust = 1), 
                       plot.title = element_text(hjust=0.5)) +                    
                  theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank()) + 
                  NULL
         
ggplotly(election_polls)

Comment:

The CDU gained popularity in March of 2020 for its policies as the pandemic started. For example, in Spain and Peru, the governing party also gained popularity at the start of the pandemic. People believed that the government’s decisions about lockdowns and financial support were optimal which may explain the higher popularity. After Armin Laschet was elected chairman of the Angela Merkel’s Christian Democrat party in January, the CDU progressively lost popularity as he is seen to be an unfavourable candidate to replace Merkel and he is viewed by political critics as lacking empathy. The Social Democrats have taken the lead in the polls for the first time in 15 years. This is due to the unpopularity of the CDU and Germans now are very concerned with the level of indebtedness in the country. They want looser fiscal policy and taxes on the rich and wealthy which has attracted a lot of support among the German population.

7 Details

  • Who did you collaborate with- NA
  • Approximately how much time did you spend on this problem set- approximately 18-20 hours
  • What, if anything, gave you the most trouble- data wrangling